Take Home Exercise 1

Visual Analytics of the demographic and financial characteristics of residents in City of Engagement

Author

Oh Jia Wen

Published

May 14, 2023

Modified

May 14, 2023

1. OVERVIEW

City of Engagement is a small city located at Country of Nowhere, with a total population of 50,000, serving as an agriculture region. The local council of the city is in the midst of preparing the Local Plan 2023.

1.1 The Task

In this take-home exercise, patterns containing demographics and financial characteristics of residents in City of Engagement will be unveiled using appropriate static and interactive statistical graphics methods.

2. Datasets

Data has been collected by the local council of the city. The survey sampled 1,000 respondents to collect data related to their household demographic, spending patterns and among, other things. The data is stored in two separate files with FinancialJournal.csv ranging from Mar 2022 to Feb 2023 :

1) Participants.csv 2) FinancialJournal.csv
Rows 1,011 1,513,636
Variables 7 4

2.1 Metadata

File Variables Name Description

participants.csv

FinancialJourval.csv

participantId Unique identification to represent the Participants
participants.csv householdSize Represents the number of people in the household
participants.csv haveKids Binary value (True/False) indicating if participant have a kid(s)
participants.csv age Represents age of the participant
participants.csv educationLevel Represents the highest education attained by participant
FinancialJourval.csv interestGroup Represents the group associated with the participant
FinancialJourval.csv joviality Represents the level of happiness by participant at the start of the survey
FinancialJourval.csv timestamp Represents the date and time the entry was inputted
FinancialJourval.csv category Represents the type of income/expenses incurred at a given timestamp
FinancialJourval.csv amount Represents the amount received ( + income), amount paid ( - expenses )

3. Data Preparation

3.1 Install R-packages

Using p_load() of pacman package to load and install the following libraries:

  • ggiraph : For creating interactive ‘ggplot’ graphics

  • plotly : For creating interactive statistical graphs

  • patchwork : For combining multiple ggplot2 graphs into one figure

  • tidyverse : A collection of R packages use in everyday data analyses. It is able to support data science, data wrangling, and analysis.

  • knitr: For dynamic report generation

  • ggstatsplot: For creating graphics with details from statistical tests included and its plot

  • paletteer: Collection of color palettes

  • wesanderson: Wes Anderson’s theme Palette Generator

  • scales: For customizing the appearance of axis and legend labels

  • png: For reading, writing, and displaying bitmap imaged in PNG format

  • nortest: For normality test

  • webshot2: For taking screenshot of web pages

  • rstatix: For performing statistical tests and correlation analyses

  • gt : For constructing of table

  • stats : For statistical calculations

  • gridExtra: For arranging multiple grid-based plot on a page, and draw tables

  • ggpubr: For creating and customizing ‘ggplot2’ based plots

  • ggdist: For visualization of distributions and uncertainty

pacman::p_load(ggiraph, plotly, patchwork, tidyverse,
               knitr,ggstatsplot,paletteer,wesanderson,
               scales, ggpubr, rstatix,gt,webshot2,
               png,nortest, stats, gridExtra,ggdist)

options(scipen = 999)
Tip

options(scipen = 999) : removes scientific notation in our exercise.

3.2 Import Data

3.2.1 Import participants dataset

participants <- read_csv("data/Participants.csv")

3.2.2 Load participants

# A tibble: 6 × 7
  participantId householdSize haveKids   age educationLevel      interestGroup
          <dbl>         <dbl> <lgl>    <dbl> <chr>               <chr>        
1             0             3 TRUE        36 HighSchoolOrCollege H            
2             1             3 TRUE        25 HighSchoolOrCollege B            
3             2             3 TRUE        35 HighSchoolOrCollege A            
4             3             3 TRUE        21 HighSchoolOrCollege I            
5             4             3 TRUE        43 Bachelors           H            
6             5             3 TRUE        32 HighSchoolOrCollege D            
# ℹ 1 more variable: joviality <dbl>
head(participants)

3.2.3 Import Financial Journal dataset

financial_journal <- read_csv("data/FinancialJournal.csv")

3.2.4 Load Financial Journal

# A tibble: 6 × 4
  participantId timestamp           category  amount
          <dbl> <dttm>              <chr>      <dbl>
1             0 2022-03-01 00:00:00 Wage      2473. 
2             0 2022-03-01 00:00:00 Shelter   -555. 
3             0 2022-03-01 00:00:00 Education  -38.0
4             1 2022-03-01 00:00:00 Wage      2047. 
5             1 2022-03-01 00:00:00 Shelter   -555. 
6             1 2022-03-01 00:00:00 Education  -38.0
head(financial_journal)

3.3 Data Wrangling

As seen from the two data tables above, there are some issues that could be rectify. Henceforth, the following adjustments are made:

3.3.1 participants.csv :

  1. participantId is a <dbl> variable. (Rectify by reformatting it to <chr>)

  2. householdSize is a <dbl> variable. (Revised to <ord> for the order of categories)

  3. age is a continuous variable which makes it harder to visualize the demographics (Create a new column with 5-class variables after determining the youngest and oldest demographics.)

Show the code
#check min and max age of residents in COE. 
min(participants$age)
[1] 18
Show the code
max(participants$age)
[1] 60
  1. educationLevel is a <chr> variable. (Revised to <ord> for the order of categories)

  2. joviality has nine decimal places. (Rectify by rounding it to 2.d.p and create new 5-class variables for future analysis)

Show the code
#create new dataset 
participants_new <- participants %>%
    mutate(
          participantId = as.character(participantId),
          #binned joviality to 5-class variables 
          joviality_bins = cut(joviality, breaks = c(0.0,0.2,0.4,0.6,0.8,1.0))
          )

#reformat householdSize to Ordinal 
    participants_new$householdSize <- factor(participants$householdSize,
                                      levels = c("1", "2", "3"), 
                                      ordered = TRUE) 
#reformat age group 
participants_new$age_group <- factor(ifelse(participants$age < 20, "Under 20",
                ifelse(participants$age < 30, "20-29",
                    ifelse(participants$age < 40, "30-39",
                      ifelse(participants$age < 50, "40-49", "Above 50")))),
                levels = c("Under 20", "20-29", "30-39", "40-49", "Above 50"),
                ordered= TRUE)

#reformat education level to Ordinal 
participants_new$educationLevel <- factor(participants$educationLevel, 
                                      levels = c("Low", "HighSchoolOrCollege", 
                                                 "Bachelors", "Graduate"
                                                 ), 
                                      ordered = TRUE)  

#round up joviality to 2 decimal places 
participants_new$joviality <- round(participants$joviality, 2) 

#output data frame
participants_new
# A tibble: 1,011 × 9
   participantId householdSize haveKids   age educationLevel      interestGroup
   <chr>         <ord>         <lgl>    <dbl> <ord>               <chr>        
 1 0             3             TRUE        36 HighSchoolOrCollege H            
 2 1             3             TRUE        25 HighSchoolOrCollege B            
 3 2             3             TRUE        35 HighSchoolOrCollege A            
 4 3             3             TRUE        21 HighSchoolOrCollege I            
 5 4             3             TRUE        43 Bachelors           H            
 6 5             3             TRUE        32 HighSchoolOrCollege D            
 7 6             3             TRUE        26 HighSchoolOrCollege I            
 8 7             3             TRUE        27 Bachelors           A            
 9 8             3             TRUE        20 Bachelors           G            
10 9             3             TRUE        35 Bachelors           D            
# ℹ 1,001 more rows
# ℹ 3 more variables: joviality <dbl>, joviality_bins <fct>, age_group <ord>

3.3.2 FinancialJourval.csv :

  1. Similar issue as point 1 above.

  2. Timestamp is a <POSIX> variable. (Rectify by reformatting it to <chr> in year-mth format)

  3. As per the code below, there are duplicate entries in the financial journal. (Rectify by using the distinct() function from [dplyr package] in tidyverse)

Show the code
#check for duplicates 
dup <- (nrow(financial_journal) - nrow(unique(financial_journal)))
#reformat output 
dup_reformat <- format(dup, big.mark=",")
#print output
dup_reformat
[1] "1,113"
  1. “Category” is not very useful. (Rectify by using pivot_wider() function from [tidyr package] to transpose)
Show the code
    #remove duplicate rows for all columns
    financial_journal_lessdup <- financial_journal %>% 
      distinct()

    #create new dataset 
    grouped_data <- financial_journal_lessdup %>%
      
    #recode participantId from dbl to chr, format timestamp to year_mth and round amount 
        mutate(participantId = as.character(participantId),
             year_mth = format(as.Date(financial_journal_lessdup$timestamp), "%Y-%m"),
             amount = abs(round(amount,2)),
             .before = 3) %>%
      
    #group the columns in the following order and sum the amount to total_amount
      group_by( participantId, year_mth,category) %>%
      summarize(total_amount = sum(amount)) 

    # pivot the data frame to have categories as columns
    pivoted_fj <- grouped_data %>%
      pivot_wider(names_from = "category", 
                  values_from = "total_amount", values_fill = 0)

    # create new column from list of categories
    pivoted_fj$Shelter_new <- pivoted_fj$Shelter + pivoted_fj$RentAdjustment
    pivoted_fj$Expenses <- pivoted_fj$Education + pivoted_fj$Food +  
    pivoted_fj$Recreation + pivoted_fj$Shelter + pivoted_fj$RentAdjustment
    pivoted_fj$Income <- pivoted_fj$Wage
    pivoted_fj$Cashflow <- pivoted_fj$Income - pivoted_fj$Expenses

    # output the pivoted data frame
    pivoted_fj %>%
      select(c(1:5,7,9:12))
# A tibble: 10,691 × 10
# Groups:   participantId, year_mth [10,691]
   participantId year_mth Education  Food Recreation   Wage Shelter_new Expenses
   <chr>         <chr>        <dbl> <dbl>      <dbl>  <dbl>       <dbl>    <dbl>
 1 0             2022-03       38.0  268.      349.  11932.        555.    1210.
 2 0             2022-04       38.0  266.      219.   8637.        555.    1078.
 3 0             2022-05       38.0  265.      383.   9048.        555.    1241.
 4 0             2022-06       38.0  257.      466.   9048.        555.    1316.
 5 0             2022-07       38.0  270.     1069.   8637.        555.    1933.
 6 0             2022-08       38.0  262.      314.   9459.        555.    1169.
 7 0             2022-09       38.0  256.      295.   9048.        555.    1144.
 8 0             2022-10       38.0  267.       25.0  8637.        555.     885.
 9 0             2022-11       38.0  261       377.   9048.        555.    1231.
10 0             2022-12       38.0  266.      357.   9048.        555.    1216.
# ℹ 10,681 more rows
# ℹ 2 more variables: Income <dbl>, Cashflow <dbl>
  1. Type of Income/Expenses are all labelled in a column. (Create new columns)

    Data Table Variables Name Description
    resident_profile_rev Income Category :Wage
    resident_profile_rev Expenses Category: Education + Recreation + Food + Shelter_new (Shelter + RentAdjustment)
    resident_profile_rev Cashflow NEW : Income - Expenses
  2. Multiple zero values for RentAdjustment. As seen from the data frame above and the code chunk below, there are only 72 rows. Thus, it will be combined with Shelter.

colSums(pivoted_fj[-1] !=0)
      year_mth      Education           Food     Recreation        Shelter 
         10691           3018          10691           9492          10560 
          Wage RentAdjustment    Shelter_new       Expenses         Income 
         10691             72          10560          10691          10691 
      Cashflow 
         10691 

3.4 Merging of Data frame

Full_join will be used to create a new table by joining the cleaned participants dataset and pivoted financial journal dataset. It will be match by participant’s ID. Likewise, the sequence in the dataset will be relocated to highlight several columns.

Show the code
#join both data sets 
resident_profile <- full_join(participants_new, pivoted_fj, 
                       by = c("participantId" = "participantId")) %>%
#relocate columns to the front (by importance)
                    relocate(year_mth, .after =participantId) %>%
                    relocate(Cashflow, .after = year_mth) %>%
                    relocate(age_group, .after = Cashflow) %>%
                    relocate(educationLevel, .after = age_group) %>%
                    relocate(Income, .after = haveKids) %>%
                    relocate(Expenses , .after = Income) 
resident_profile %>%
    select(c(1:18))
# A tibble: 10,691 × 18
   participantId year_mth Cashflow age_group educationLevel      householdSize
   <chr>         <chr>       <dbl> <ord>     <ord>               <ord>        
 1 0             2022-03    10722. 30-39     HighSchoolOrCollege 3            
 2 0             2022-04     7559. 30-39     HighSchoolOrCollege 3            
 3 0             2022-05     7808. 30-39     HighSchoolOrCollege 3            
 4 0             2022-06     7733. 30-39     HighSchoolOrCollege 3            
 5 0             2022-07     6704. 30-39     HighSchoolOrCollege 3            
 6 0             2022-08     8291. 30-39     HighSchoolOrCollege 3            
 7 0             2022-09     7904. 30-39     HighSchoolOrCollege 3            
 8 0             2022-10     7752. 30-39     HighSchoolOrCollege 3            
 9 0             2022-11     7817. 30-39     HighSchoolOrCollege 3            
10 0             2022-12     7832. 30-39     HighSchoolOrCollege 3            
# ℹ 10,681 more rows
# ℹ 12 more variables: haveKids <lgl>, Income <dbl>, Expenses <dbl>, age <dbl>,
#   interestGroup <chr>, joviality <dbl>, joviality_bins <fct>,
#   Education <dbl>, Food <dbl>, Recreation <dbl>, Shelter <dbl>, Wage <dbl>

3.4.1 Entries Check

To ensure data accuracy, the code chunk below checks the completeness of the data. Given that the data has a time period of one year, the code examines if the participants have entries for the entire time period.

Show the code
#check if participants_id have entries for the entire year 
participant_counts <- resident_profile %>%
  group_by(participantId) %>%
  summarise(num_months = n_distinct(year_mth)) %>%
  ungroup()

filtered_count <- participant_counts %>%
  filter(num_months != 12) %>%
  nrow()

filtered_count 
[1] 131

It has been observed that there are 131 participants who do not have entries for the entire time period. To avoid inaccuracy, these group of participants will be excluded from the analysis.

3.4.2 Missing Values Check

Through the code chunk below, we confirmed that there are no missing values in resident_profile dataset.

#Check for missing values
any(is.na(resident_profile))
[1] FALSE

3.4.3 Revised Resident’s Profile Dataset

Given that we have removed duplicates in section 3.3.2, removed entries in section 3.4.1 and observed no missing values in section 3.4.2, the resident’s profile data set have been revised. We will be using the knitr: kable() function to display the final dataset.

Show the code
#create a revised dataframe to exclude id that do not have entries for the time period
resident_profile_rev <- resident_profile %>%
  group_by(participantId) %>%
  mutate(num_months = n_distinct(year_mth)) %>%
  ungroup() %>%
  filter(num_months == 12) %>%
  select(-num_months, -Shelter, -RentAdjustment, -Wage)

#output for dataframe using knitr:: kable
kable(head(resident_profile_rev), "simple") 
participantId year_mth Cashflow age_group educationLevel householdSize haveKids Income Expenses age interestGroup joviality joviality_bins Education Food Recreation Shelter_new
0 2022-03 10722.01 30-39 HighSchoolOrCollege 3 TRUE 11931.95 1209.94 36 H 0 (0,0.2] 38.01 268.26 348.68 554.99
0 2022-04 7558.67 30-39 HighSchoolOrCollege 3 TRUE 8636.88 1078.21 36 H 0 (0,0.2] 38.01 265.79 219.42 554.99
0 2022-05 7807.63 30-39 HighSchoolOrCollege 3 TRUE 9048.16 1240.53 36 H 0 (0,0.2] 38.01 264.54 382.99 554.99
0 2022-06 7732.59 30-39 HighSchoolOrCollege 3 TRUE 9048.16 1315.57 36 H 0 (0,0.2] 38.01 256.90 465.67 554.99
0 2022-07 6704.27 30-39 HighSchoolOrCollege 3 TRUE 8636.88 1932.61 36 H 0 (0,0.2] 38.01 270.13 1069.48 554.99
0 2022-08 8290.55 30-39 HighSchoolOrCollege 3 TRUE 9459.44 1168.89 36 H 0 (0,0.2] 38.01 261.76 314.13 554.99

4. Exploratory Data Visualization

In this section, we will design plots with interactivity for users to study and explore the data. The plots are created with the use of giraph , plotly, and patchwork packages.

4.1 Interactive Dashboard

A dashboard is created to provide an overview of the demographics of residents in City of Engagement across age group. Bar chart is chosen to show segments of information by comparing different categorical variables. A design layout is included in the code to better visualized the output through patchwork. Moreover, tooltip is used to highlight the specific age group at the point of the data.

Show the code
#Create new df by doing a full join with participants_counts
participants_new_rev <- full_join(
  participants_new, participant_counts, 
  by = c("participantId" = "participantId"))  %>%
  group_by(participantId) %>%
  #participants_counts contains the number of months the participants entries are in
  ungroup() %>%
  #filter away participantsId who do not have entries for the full year
  filter(num_months == 12) %>%
  select(-num_months)

#create tooltip to display age group 
participants_new_rev$tooltip <-c(paste0(
  "Age Group:", participants_new_rev$age_group))

#Bar chart for resident's age distribution
p1 <- ggplot(data= participants_new_rev,
      aes(x = age_group)) +
      geom_bar_interactive(aes(tooltip = participants_new_rev$tooltip, 
                               stackgroups = TRUE,
                               data_id= age_group)) + 
      scale_fill_manual(values = wes_palette("Chevalier1")) +
      xlab("Age Group") +
      ylab("No. of participants") +
      theme(axis.text.x=element_text(size=5)) +
      theme(axis.title.y=element_text(size=10)) +
      ylim(0,250)

#Bar chart for resident's household size distribution
p2 <- ggplot(data= participants_new_rev,
      aes(x = householdSize)) +
      geom_bar_interactive(aes(tooltip = participants_new_rev$tooltip, 
                               stackgroups = TRUE,
                               data_id= age_group)) +
      scale_fill_manual(values = wes_palette("Chevalier1")) +
      xlab("Size of Household") +
      ylab("") +
      theme(axis.text.x=element_text(size=5)) +
      theme(axis.title.y=element_text(size=10)) +
      ylim(0,350)


#Bar chart for resident's education level
p3 <- ggplot(data= participants_new_rev,
      aes(x = educationLevel)) +
      geom_bar_interactive(aes(tooltip = participants_new_rev$tooltip, 
                               stackgroups = TRUE,
                               data_id= age_group)) +
      xlab("Education Level") +
      ylab("") +
      theme(axis.text.x=element_text(size=5)) +
      theme(axis.title.y=element_text(size=10)) +
      ylim(0,350)

#Bar chart to visualize if residents have kids 
p4 <- ggplot(data= participants_new_rev,
      aes(x = haveKids)) +
      geom_bar_interactive(aes(tooltip = participants_new_rev$tooltip, 
                               stackgroups = TRUE,
                               data_id= age_group)) +
      ylab("") +
      theme(axis.text.x=element_text(size=5)) +
      theme(axis.title.y=element_text(size=10)) +
      ylim(0,400)

#Bar chart for residents' interest group 
p5 <- ggplot(data= participants_new_rev,
      aes(x = interestGroup)) +
      geom_bar_interactive(aes(tooltip = participants_new_rev$tooltip, 
                               stackgroups = TRUE,
                               data_id= age_group)) +
      xlab("Interest Group") +
      ylab("No. of participants") +
      theme(axis.text.x=element_text(size=5)) +
      theme(axis.title.y=element_text(size=10)) +
      ylim(0,120)

#Bar chart for residents' joviality in bins 
p6 <- ggplot(data= participants_new_rev,
      aes(x = joviality_bins)) +
      geom_bar_interactive(aes(tooltip = participants_new_rev$tooltip, 
                               stackgroups = TRUE,
                               data_id= age_group)) +
      xlab("Joviality") +
      ylab("No. of participants") +
      theme(axis.text.x=element_text(size=5)) +
      theme(axis.title.y=element_text(size=10)) +
      ylim(0,250)

#design layout for the patchwork figure
design <- "
  132
  132
  554
  554
  666
  666
"

girafe(code = print(p1 + p2 + p3 + p4 + p5  + p6 +
                      plot_layout(design = design,) + 
                      plot_annotation(title = 
                    "Demographics Insights of residents in City of Engagement",
                     theme = theme(plot.title = element_text(size = 20, hjust=0.5))
                      )), 
       width_svg = 12,
       height_svg = 6,
       options = list(
         opts_hover(css = "fill: #02401B;"),
         opts_hover_inv(css = "opacity:0.2;")
         )
       ) 

Observations:

  • The city might be facing an aging population as age group is left-skewed.
  • Low interest for future education as education level is right-skewed.
  • Small family size with higher proportion of participants of not having Kids.
  • Uniformly distributed Interest Group. No preference among age group.
  • Joviality level seems to be decreasing at a decreasing rate
Interactivity

Click on the graph and hover around each demographics.

The respective age group will be displayed.

4.2 Financial Health of Participants

To know more about the financial health of the participants, interactive geom_point is used to plot against the time period. tooltip is included to create a snapshot of the financial health status of the participants at the time period.

scale_color_manual of [ggplot2] is used to differentiate between positive and negative cash flow. In addition, the plot contains hover effect with the use of data_id aesthetic to highlight the trend of the participant’s cash flow.

Show the code
#tooltip output to display ID, Cashflow, Income, and Expenses
resident_profile_rev$tooltip <- paste0(
  "Participant's ID = ", resident_profile_rev$participantId,
  "\n Cashflow = ", format(resident_profile_rev$Cashflow, big.mark = ","),
  "\n Income = ", format(resident_profile_rev$Income, big.mark = ","),
  "\n Expenses = ", format(resident_profile_rev$Expenses, big.mark = ",")
                                  )
#tool_tip design
tooltip_css <- "background-color: lightgrey; #<<
font-style:bold; color: #446455;" #<<

ie <-  ggplot(data=resident_profile_rev) +
       geom_point_interactive(aes(x = year_mth, y = Cashflow,
                                   tooltip = resident_profile_rev$tooltip,
                                   data_id = participantId,
                                   #if Cashflow >0 = Green, else Red
                                   color = ifelse(Cashflow >= 0, 
                                                  "Above 0", "Below 0")
                                   )) +
        scale_color_manual(values = c("Above 0" = "#446455", 
                                      "Below 0" = "#C93312")) +
        #remove legend title 
      labs(color = "") +
      labs(title="Financial Health of Participants from Mar 2022 to Feb 2023") +
      ylab("Cashflow ($)") + xlab("Year-month") +  
      scale_y_continuous(labels = comma_format()) +
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
      theme_minimal()        |>

girafe(                                  
  ggobj = ie,                             
  width_svg = 6,                         
  height_svg = 6*0.618,
  options = list( #<<
    opts_tooltip(css = tooltip_css), #<<
    opts_hover_inv(css = "opacity:0.1;") #<<
  )                                        
)   

Observations:

  • Majority of participants have positive cash flow (Good Credit Rating)
  • Negligible growth on cash flow
  • No drastic changes in participant’s cash flow over time

4.3 The Circulation of Money

As illustrated earlier, the cash flow is a holding transaction. It is holded by the individual after repaying the necessary debt/expenses. To build more wealth, it is ideal for Wage to increase or for expenses to decrease.

With that in mind, a line scatter pot is used to identify the trend in various categories over a time period. The participants dataset have been revised to exclude participants who did not fulfill the entries check. It has been reformatted before executing a full_join with participants_count data set. The joined table concatenates the count of the year-mth period (num_months) that is not available in the dataset and filters it away.

The plot_ly() graph below shows the total amount circulated by the participants.

Show the code
#to reformat partcipantId, create year-mth, round off amount to 2.d.p
financial_journal_lessdup <- financial_journal_lessdup %>%
  mutate(participantId = as.character(participantId),
        year_mth = format(as.Date(financial_journal_lessdup$timestamp), "%Y-%m",
        amount = abs(round(amount,2)),
        .before = 1)) 

#Create new df by doing a full join with participants_counts
financial_journal_lessdup_lessentries <- full_join(
  financial_journal_lessdup, participant_counts, 
  by = c("participantId" = "participantId"))  %>%
  group_by(participantId) %>%
  #participants_counts contains the number of months the participants entries are in
  mutate(num_months = n_distinct(year_mth)) %>%
  ungroup() %>%
  #filter away participantsId who do not have entries for the full year
  filter(num_months == 12) %>%
  select(participantId, year_mth, category,amount)

#create another df to group by year_mth, category
grouped_data_rev <- financial_journal_lessdup_lessentries %>%
  #group the columns in the following order 
  group_by(year_mth, category) %>%
  summarize(total_amount = sum(amount)) 

#creating interactive graph
plot_ly(data = grouped_data_rev, 
  x = ~year_mth, y = ~total_amount, color = ~category,
  type = 'scatter', mode = 'line',
  hovertemplate = ~paste("Year-Month:", year_mth,
                         "<br>Amount:", format(total_amount, big.mark = ","))) |>

#Configure title and axes
  layout(title = "The Circulation of Money",
         xaxis = list(title = "Time Period"),
         yaxis = list(title = "Amount"))

Observations:

  • Income (~Wage) is fairly constant except for Mar 2022. It can be inferred that the spike in Wage could be a form of bonus, grant, incentives given by the service centre and might not be an one off special event (requires longer time period to determine).

  • Expenses distribution seems to be constant with lesser spending on Education expenses . However, there is a slight fluctuation in Recreation expenses.

4.3.1 Spending Patterns of participants

As observed earlier, there are a slight fluctuation in Recreation expenses. Henceforth, Wage will be removed in the graph to better visualize the expenses.

Plot_ly graph is plotted to visualize the spending patterns.

Show the code
#create a new dataset
grouped_data_rev_new <- financial_journal_lessdup_lessentries %>%
  mutate(amount = abs(round(amount,2))) %>%
  #group the columns in the following order 
  group_by(category,year_mth) %>%
  summarize(total_amount = sum(amount)) 

# Filter out "Wage" category from the data frame
grouped_data_rev_newest <- grouped_data_rev_new %>% 
   filter(category != "Wage")

#creating interactive graph
plot_ly(data = grouped_data_rev_newest, 
  x = ~year_mth, y = ~total_amount, color = ~category,
  type = 'scatter', mode = 'line',
  hovertemplate = ~paste("Year-Month:", year_mth,
                         "<br>Amount:", format(total_amount, big.mark = ","))) |>

#Configure title and axes
  layout(title = "Spending Patterns of participants from Mar 2022 - Feb 2023",
         xaxis = list(title = "Time Period"),
         yaxis = list(title = "Total Expenses"))

Observations:

  • Education remain constant throughout while Shelter decrease in Mar 2022 and remains constant. Both expenses are a fixed expense.
  • Recreation expenses fluctuates more in comparison to Food expenses.

5. Confirmatory Data Analysis Visualization

In this section, we will focus more on the statistical testing that are used in Confirmatory Data analysis. The plots are created with the use of ggbarstats , ggbetweenstats, and gscatterstats packages.

5.1 Association Test between Age group and Education Level

As observed in Section 4.1 - Interactive Dashboard, we noticed that the age-group is left-skewed while the education level is right-skewed. Therefore, we would like to test if there is any association between the two variables. Notably, the association test is non-parametric and thus, does not have to conform to the normality assumption.

At 95% confidence level,

Ho : No association exists between the age group and education level

H1: Association exists between the age group and education level

Show the code
ggbarstats(data = resident_profile_rev, 
           x = educationLevel, y = age_group,
           xlab= "Age Group", ylab = "Education Level",
           title = "Comparison of Education level across Age Group",
           type = "nonparametric", conf.level = 0.95,
           package = "wesanderson", palette = "Chevalier1"
           )

From the graph, we observed that majority of the age group have a HighSchoolOrCollege education level. Likewise, we also note that the interest for further education decreases as the age increases. Interestingly, young adults aged 20-29 have attained a higher education level than any other age group.

From the test result above (p<0.05) , we conclude that there is an association between the age group and education level as we reject the null hypothesis.

5.2 Differences in Joviality based on Education Level/Age Group

As defined earlier, Joviality indicates the participant’s overall happiness at the start of the study. We will like to found out if there is a difference in Joviality based on Education Level. Before testing our hypothesis, we will perform a normality assumption test at 95% confidence level.

5.2.1 Normality Assumption Test

The statistical graph - Normal Quantile Plot (QQ Plot) is used to visualize a normal distribution as it shows the type of distribution. To be normally distributed, the dot points should be scatter very closely to the slope line. However, as seen from bottom left figure, the dot points curve. Therefore, by the data visualization alone, we can visually confirmed that the observed values failed to conform to the normality assumption.

To confirm our data visualization, we perform an Anderson-Darling normality test.

At 95% confidence level:

Ho: the observed distribution resembles normal distribution

H1: the observed distribution failed to resemble normal distribution

  • Show the code
    #Anderson-Darling normality test
    ad_test <- ad.test(resident_profile_rev$joviality)
    
    # QQ plot
    qq <- ggplot(data = resident_profile_rev, 
                 aes(sample = joviality)) +
          stat_qq() +
          stat_qq_line() 
    
    
    # Create a data frame with ad_test result
    table_data <- data.frame(
      Test = "Anderson-Darling",
      Statistic = ad_test$statistic,
      p_value = ad_test$p.value
      )
    
    # Save plot and table as png image
    png("qq_plot_with_table.png")
    grid.arrange(qq, tableGrob(table_data), nrow = 1, 
                 top = text_grob("Normality Quantile Plot with Anderson-Darling Normality Test", size =20))
    
    #display output using knitr 
    knitr::include_graphics("qq_plot_with_table.png")

Based on the result above, we concluded that there is enough statistical evidence to reject the null hypothesis. Since the p-values fall below (p < 0.05), normality is not assumed. Henceforth, we will use the non-parametric test to conduct our analysis.

5.2.2 Kruskal-Wallis Test for Joviality across Education Level

We will test the following hypothesis at 95% Confidence Level:

Ho : the median Joviality across different education level is the same

H1: the median Joviality across different education level is not the same

Show the code
ggbetweenstats(data = resident_profile_rev,
       x= educationLevel, y= joviality, type ="np",
       xlab= "Education Level", ylab = "Joviality",
       title = "Comparison of Joviality across Education Level",
       pairwise.comparisons = TRUE, pairwise.display ="ns", conf.level = 0.95,
       package = "wesanderson", palette = "Chevalier1"
       )

As observed from the test results above, the P-value is lower than the 0.05. As such, there is enough statistical evidence to reject the null hypothesis that the median joviality across education level is the same.

Additionally, we want to find out if there any distinct similarities between the district. From the figure above, we discovered that not all pair comparison are statistically significant. The pair (Low and HighSchoolOrCollege) is not statistically significant with a P-value of 0.08, which is greater than 0.05. Thus, we cannot reject the null hypothesis that there is not differences between the joviality level between the pair.

5.2.3 Plotting Confidence Interval of Point Estimates

stat_pointinterval() of [ggdist] package is used to visualize the distribution of confidence interval by education level. Due to skewness of data, the median point estimate are used. To distinguish between the interval, scale_color_manual() of [ggplot2] is used (Note: color contrast is added) .

Show the code
#plot points and intervals 
ggplot(data = resident_profile_rev, 
       aes(x = educationLevel, y = joviality)) +
  stat_pointinterval(aes(interval_color = after_stat(level)),
                     point_interval = "median_qi",
                     .width = c(0.95,0.99),
                     point_color = "#C93312") +
  labs(title = "Visualizing Confidence Intervals of Median Joviality", 
       x = "Education Level", y = "Joviality") +
#add colors to graph 
  scale_color_manual(values = c("#D3DDDC","#446455"), 
                     aesthetics = "interval_color") +

  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) 

As seen above, the are high uncertainty across the education level, with an increase in median estimate. This suggests that there might be large presence of outliers in each education level.

5.2.4 Kruskal-Wallis Test for Joviality across Age Group

Earlier, we discovered that there is a statistical difference in joviality across Education Level, we would like to examine if the same applies across age group. As per section 5.2.1, normality is not assumed. Thus, we will use non-parametric test to conduct our analysis.

We will test the following hypothesis at 95% Confidence Level:

Ho : the median Joviality across the age group is the same

H1: the median Joviality across the age group is not the same

Show the code
ggbetweenstats(data = resident_profile_rev,
       x= age_group, y= joviality, type ="np",
       xlab= "Age Group", ylab = "Joviality",
       title = "Comparison of Joviality across Age Group",
       pairwise.comparisons = TRUE, pairwise.display ="ns", conf.level = 0.95,
       package = "wesanderson", palette = "Chevalier1"
       )

As observed from the test results above, the P-value is lower than the 0.05. As such, there is enough statistical evidence to reject the null hypothesis that the median joviality across age group is the same.

From the figure above, we discovered that not all pair comparison are statistically significant as there are two pairs with p-value greater than 0.05. The pairs (age group 20-29 and age group 30-39) and (age group 30-39 and age group 40-49) are not statistically significant with a P-value of 0.36 and 0.20 respectively. Thus, we cannot reject the null hypothesis that there is not differences between the joviality level between the pairs.

5.3 Differences in Joviality based on Income

We will like to found out if there is a difference in Joviality based on Income. A plot_ly() graph is use to graph. Opacity is included to highlight the contrast and a hovertemplate is included to reflect the <participantID>, <Cashflow>, <Income>, and Expenses .

Show the code
plot_ly(data = resident_profile_rev, 
             x = ~joviality, y = ~Income,
        hovertemplate = ~paste("<br>Participant's ID:",participantId,
                               "<br>Cashflow:", Cashflow,
                              "<br>Income:", Income,
                               "<br>Expenses:", Expenses),
             
            type = "scatter",
            mode = "markers",
            marker = list(opacity = 0.7,sizemode = "diameter", 
                          line = list(width =0.1, color = "white"))) |>

#add title and labels to axis 
        layout(title = "Income vs Joviality" ,
         xaxis = list(title = "Joviality level") ,
         yaxis = list(title = "Income"))

Observations:

  • Joviality level decreases as income level increases. It decreases sharply when income exceeds $15k.

  • Similar to income range of $10-$15k where majority have low joviality

  • Joviality level are more spread out when income ranges below $5k

5.3.1 Correlation Analysis between Joviality and Income

Since joviality is not normally distributed, a ggscatterstats() from [ggstatsplot] is used to build a visual for Significant Test of Correlation between Joviality and Income.

Show the code
ggscatterstats(
  data = resident_profile_rev,
  x = joviality, y = Income,
  type = "nonparametric", marginal = TRUE,
  title = "Significant Test of Correlation between Joviality and Income",
  xlab = "Joviality",
  ylab = "Income"
)

The test result of The Spearman correlation coefficient indicates that the there is a weak negative linear relationship between Joviality and Income. Even though the correlation is on the weaker end, we can conclude that there is an inverse relationship between them. The higher you earn, the lower your joviality.

Correlation Coefficient

To interpret the values of correlation (-1 <= x<= 1),

  • ‘+’ reflects a positive linear relationship

  • ‘-’ reflects a negative linear relationship

  • A value of <0.3 is weak, 0.5 is moderate, >0.8 is strong, and 1 is perfect

5.4 Factors affecting Cash flow

Notably, the circulation of money is crucial for City of Engagement and could potentially impact the Local Plan 2023 that is in the midst of preparation. Henceforth, we would like to into the correlation between income and expenses across education level.

5.4.1 Cash flow vs Income across Education Level

Multiple ggscatterstats() from [ggstatsplot] is used to build a visual for Significant Test of Correlation between Cashflow and Income by filtering the education level. patchwork is included to combine the graphs.

Show the code
#plotting correlation between cashflow and income across education level
edu_low <- ggscatterstats(data = resident_profile_rev |> 
                           filter(educationLevel == "Low"), 
                           x = Income, y = Cashflow,
                           type = "nonparametric") + 
           theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
           labs(title = "Low", 
           x = "Income", y = "Cash flow") +
           scale_y_continuous(labels = comma_format()) 

edu_hc <- ggscatterstats(data = resident_profile_rev |> 
                           filter(educationLevel == "HighSchoolOrCollege"), 
                           x = Income, y = Cashflow,
                           type = "nonparametric") + 
          theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
          labs(title = "High School", 
          x = "Income", y = "Cash flow") +
          scale_y_continuous(labels = comma_format()) 

edu_bach <- ggscatterstats(data = resident_profile_rev |> 
                           filter(educationLevel == "Bachelors"), 
                           x = Income, y = Cashflow,
                           type = "nonparametric") + 
            theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
            labs(title = "Bachelors", 
            x = "Income", y = "Cash flow") +
            scale_y_continuous(labels = comma_format()) 

edu_grad <- ggscatterstats(data = resident_profile_rev |> 
                           filter(educationLevel == "Graduate"), 
                           x = Income, y = Cashflow,
                           type = "nonparametric") + 
            theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
            labs(title = "Graduate", 
            x = "Income", y = "Cash flow") +
            scale_y_continuous(labels = comma_format()) 

#combined plot and ensure layout is in order
corr_edu <- ((edu_low + edu_hc) / (edu_bach + edu_grad) + plot_spacer())

#add labels
corr_edu + plot_annotation(title = "Correlation between Cashflow and Income", 
                           subtitle = "High correlation between Cashflow and Income at all education level",
                           theme = theme(
                             plot.title = element_text(size = 18),
                             plot.subtitle = element_text(size = 12)))

As seen above, there is an almost perfect positive linear relationship (>0.9) between Cash flow and Income across Education level. The Spearman correlation coefficient also increases at different education level.

5.4.2 Cash flow vs Expenses across Education Level

Similar approach to the previous section.

Show the code
#plotting correlation between expenses and cashflow across education level
edu_low_ex <- ggscatterstats(data = resident_profile_rev |> 
                           filter(educationLevel == "Low"), 
                           x = Expenses, y = Cashflow,
                           type = "nonparametric") + 
           theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
           labs(title = "Low", 
           x = "Expenses", y = "Cash flow") +
           scale_y_continuous(labels = comma_format()) 

edu_hc_ex <- ggscatterstats(data = resident_profile_rev |> 
                           filter(educationLevel == "HighSchoolOrCollege"), 
                           x = Expenses, y = Cashflow,
                           type = "nonparametric") + 
          theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
          labs(title = "High School", 
          x = "Expenses", y = "Cash flow") +
          scale_y_continuous(labels = comma_format()) 

edu_bach_ex <- ggscatterstats(data = resident_profile_rev |> 
                           filter(educationLevel == "Bachelors"), 
                           x = Expenses, y = Cashflow,
                           type = "nonparametric") + 
            theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
            labs(title = "Bachelors", 
            x = "Expenses", y = "Cash flow") +
            scale_y_continuous(labels = comma_format()) 

edu_grad_ex <- ggscatterstats(data = resident_profile_rev |> 
                           filter(educationLevel == "Graduate"), 
                           x = Expenses, y = Cashflow,
                           type = "nonparametric") + 
            theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
            labs(title = "Graduate", 
            x = "Expenses", y = "Cash flow") +
            scale_y_continuous(labels = comma_format()) 

#combined plot and ensure layout is in order
corr_edu <- ((edu_low_ex + edu_hc_ex) / (edu_bach_ex + edu_grad_ex) + 
               plot_spacer())

#add labels
corr_edu + plot_annotation(title = "Correlation between Cashflow and Expenses", 
                           subtitle = "Weak-to-moderate correlation between Cashflow and Expenses at all education level",
                           theme = theme(
                             plot.title = element_text(size = 18),
                             plot.subtitle = element_text(size = 12)))

As seen above, there is a negative linear relationship between Cash flow and Expenses across Education level. Low Education Level and Bachelors Education level have a weak-to-moderate correlation while HighSchoolOrCollege and Graduate have a weak negative linear relationship. Although the correlation is weak, we can still infer that the cashflow increases when expenses decreases.

5.4.3 Correlation between Income and Expenses (Various Categories)

Notably, we observed fluctuations in Recreation and Food expenses. Henceforth, we could like to know if there are any correlation between Income and various categories. Similarly, a ggscatterstats() from [ggstatsplot] is used to build a visual for Significant Test of Correlation between Income and Various Categories.

Show the code
#plotting correlation between income and various expenses
in_food <- ggscatterstats(data = resident_profile_rev,
                           x = Food, y = Income,
                           type = "nonparametric") + 
           theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
           labs(title = "Food", 
           x = "Food", y = "Income") +
           scale_y_continuous(labels = comma_format()) 

in_recreation <- ggscatterstats(data = resident_profile_rev,
                           x = Recreation, y = Income,
                           type = "nonparametric") + 
           theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
           labs(title = "Recreation", 
           x = "Recreation", y = "Income") +
           scale_y_continuous(labels = comma_format())

in_shelter <- ggscatterstats(data = resident_profile_rev,
                           x = Shelter_new, y = Income,
                           type = "nonparametric") + 
           theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
           labs(title = "Shelter", 
           x = "Shelter", y = "Income") +
           scale_y_continuous(labels = comma_format()) 

in_education <- ggscatterstats(data = resident_profile_rev,
                           x = Education, y = Income,
                           type = "nonparametric") + 
           theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
           labs(title = "Education", 
           x = "Education", y = "Income") +
           scale_y_continuous(labels = comma_format()) 

#combined plot and ensure layout is in order
corr_income <- ((in_recreation + in_food) / (in_shelter + in_education) + 
               plot_spacer())

#add labels
corr_income + plot_annotation(title = "Correlation between Income and various Expenses", 
                           subtitle = "Weak correlation between Income and various Expenses at all education level", 
                           theme = theme(
                             plot.title = element_text(size = 18),
                             plot.subtitle = element_text(size = 12)))

From the results above, we noticed that all categories have a weak negative linear relationship with Income. Despite the fact that Education and Shelter belongs to fixed expenses whereas Recreation and Food are variable expenses, the correlation coefficient results prove otherwise.

6. Conclusion

From the above analysis, we observed the following:

  • City of Engagement might be facing an aging population.

  • Association exists between the age group and education level. The interest for further education decreases as the age increases.

  • Majority of participants have positive cash flow.

  • Joviality level varies across Education Level and Age Group with high uncertainty. In comparison to Income, we note that Joviality level decreases as Income increases. However, we found that there is a weak negative linear relationship between both variables.

  • With regards to factors affecting cashflow, Income across Education level have close to perfect positive linear relationship (>0.9). The Spearman correlation coefficient also increases at different education level. However, it has a weak-to-moderate negative linear relationship with Expenses. Likewise, there is a weak negative linear relationship between Income and Various Expenses.